Route Prediction

Motivation:

This module focuses on leveraging human knowledge to get the best route. At a given time due to certain circumstances (like rush hour) it is possible that a certain route may be faster than a route shown on the map. People who travel at those times or people who are residents of those areas have more knowledge about which route should be taken at what time. If every person entered which path to be taken based on their experience we could create a database that will help us to predict which path should be taken.

Dataset:

The format of the dataset is as follows:

uid: the unique id of the the person making the entry.
rating: the rating of the person adds weightage to the validity of the claim (this follows a rating system similar to that of cab companies) and it ranges from 0-5.
metro: 0 indicates the area is a metro and 1 indicates it isn't a metro city.
country: determines in which country the route is. Currently there are 3 countries Germany, England and Russia.
startTime: the begining of the time frame when the particular route should be used.
endTime: the ending of the time frame when the particular route should be used.
oldRoute: the old route given by the map.
newRoute: the new route that should be taken.
mapUsed: 1 indicates that the map is used whereas 0 indicates that the map is not used

Importing all the required packages



In [1]:

    
import matplotlib.pyplot as plt
%matplotlib inline
import random
import numpy as np
import pandas as pd
from sklearn import datasets, svm, cross_validation, tree, preprocessing, metrics
import sklearn.ensemble as ske
import tensorflow as tf
from tensorflow.contrib import learn as skflow









    



C:\Users\Saumya Suvarna\Anaconda3\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)



In [2]:

    
route_df = pd.read_excel('route.xls', index_col=None, na_values=['NA'])

Let's look at the data



In [3]:

    
route_df.head()









    Out[3]:







  
    
      
      uid
      name
      metro
      country
      rating
      startTime
      endTime
      oldRoute
      newRoute
      mapUsed
    
  
  
    
      0
      101
      Elisabeth Walton
      1
      England
      2.541688
      17.3
      19.3
      R1
      R1
      1
    
    
      1
      102
      Hudson Trevor
      0
      Russia
      1.858011
      15.0
      17.0
      R9
      R9
      1
    
    
      2
      103
      Helen Loraine
      1
      Germany
      0.006557
      17.0
      19.0
      R7
      R8
      0
    
    
      3
      104
      Joshua Creighton
      1
      England
      3.832214
      17.3
      19.3
      R1
      R1
      1
    
    
      4
      105
      Bessie  Daniels
      0
      England
      3.400957
      16.0
      17.3
      R3
      R4
      0

Let's look at what percentage of the drivers are using the map?



In [4]:

    
route_df['mapUsed'].mean()









    Out[4]:





0.468

47% of the drivers are following the map.

Let's see the groupings by the country



In [5]:

    
route_df.groupby('country').mean()

Approximately 45-48% ofthe drivers are using the map in each country and approximately half of the entries are made in metro cities. Let's plot this values to get batter understanding ofthe data



In [6]:

    
country_metro_grouping = route_df.groupby(['country','metro']).mean()
country_metro_grouping



In [7]:

    
country_metro_grouping['mapUsed'].plot.bar()









    Out[7]:





<matplotlib.axes._subplots.AxesSubplot at 0x2023e6beb00>

1 signifies that it is a metro

Let's visualize the data based on the rating of the drivers



In [8]:

    
group_by_age = pd.cut(route_df["rating"], np.arange(0, 6, 1))
rating_grouping = route_df.groupby(group_by_age).mean()
rating_grouping['mapUsed'].plot.bar()









    Out[8]:





<matplotlib.axes._subplots.AxesSubplot at 0x2023e876b70>

Most of the drivers are between the range of 1-2

Let's check for missing values by doing a count on each of the columns



In [9]:

    
route_df.count()









    Out[9]:





uid          500
name         500
metro        500
country      500
rating       500
startTime    500
endTime      500
oldRoute     500
newRoute     500
mapUsed      500
dtype: int64

There are no missing values. However if there are missing values we can deal with them in the following way:

If the column (col1) from which the values are missing is an important factor, we need to drop the rows containing those values.

route_df["col1"] = route_df["col1"].fillna("NA") #first fill the columns with value NA
route_df = route_df.dropna() #drop the rows with missing values

If the column(col2,col3) is not an important factor, we can drop the column
```
route_df = route_df.drop(['col2','col3'], axis=1) #drop the columns
```

Now for the preprocessing



In [10]:

    
def preprocess_route_df(df):
    processed_df = df.copy()
    le = preprocessing.LabelEncoder()
    processed_df.country = le.fit_transform(processed_df.country)
    processed_df.oldRoute = le.fit_transform(processed_df.oldRoute)
    processed_df.newRoute = le.fit_transform(processed_df.newRoute)
    processed_df = processed_df.drop(['name','uid'],axis=1)
    return processed_df

What we are basically doing here is processing the data to produce numeric labels for the string data

Let's look at the data again



In [11]:

    
processed_df = preprocess_route_df(route_df)
processed_df









    Out[11]:







  
    
      
      metro
      country
      rating
      startTime
      endTime
      oldRoute
      newRoute
      mapUsed
    
  
  
    
      0
      1
      0
      2.541688
      17.3
      19.3
      0
      0
      1
    
    
      1
      0
      2
      1.858011
      15.0
      17.0
      5
      11
      1
    
    
      2
      1
      1
      0.006557
      17.0
      19.0
      4
      10
      0
    
    
      3
      1
      0
      3.832214
      17.3
      19.3
      0
      0
      1
    
    
      4
      0
      0
      3.400957
      16.0
      17.3
      2
      6
      0
    
    
      5
      0
      2
      3.913199
      15.0
      17.0
      5
      1
      0
    
    
      6
      1
      1
      1.961354
      17.0
      19.0
      4
      10
      0
    
    
      7
      1
      1
      4.488997
      17.0
      19.0
      4
      9
      1
    
    
      8
      1
      2
      1.439535
      14.3
      17.0
      1
      2
      1
    
    
      9
      0
      1
      2.305975
      16.3
      18.0
      3
      8
      0
    
    
      10
      0
      0
      2.338589
      16.0
      17.3
      2
      5
      1
    
    
      11
      1
      2
      1.653544
      14.3
      17.0
      1
      2
      1
    
    
      12
      1
      2
      3.311774
      14.3
      17.0
      1
      3
      0
    
    
      13
      1
      1
      0.809809
      17.0
      19.0
      4
      10
      0
    
    
      14
      0
      2
      2.917041
      15.0
      17.0
      5
      1
      0
    
    
      15
      0
      1
      1.334539
      16.3
      18.0
      3
      7
      1
    
    
      16
      1
      1
      4.508392
      17.0
      19.0
      4
      9
      1
    
    
      17
      0
      2
      1.396008
      15.0
      17.0
      5
      11
      1
    
    
      18
      0
      0
      3.639424
      16.0
      17.3
      2
      6
      0
    
    
      19
      0
      1
      4.656897
      16.3
      18.0
      3
      8
      0
    
    
      20
      1
      2
      1.252721
      14.3
      17.0
      1
      2
      1
    
    
      21
      0
      2
      1.696257
      15.0
      17.0
      5
      11
      1
    
    
      22
      0
      0
      1.521224
      16.0
      17.3
      2
      5
      1
    
    
      23
      1
      1
      3.424627
      17.0
      19.0
      4
      9
      1
    
    
      24
      0
      0
      3.044657
      16.0
      17.3
      2
      6
      0
    
    
      25
      0
      2
      3.268432
      15.0
      17.0
      5
      1
      0
    
    
      26
      0
      2
      3.636979
      15.0
      17.0
      5
      1
      0
    
    
      27
      0
      2
      3.008635
      15.0
      17.0
      5
      1
      0
    
    
      28
      1
      0
      2.049766
      17.3
      19.3
      0
      0
      1
    
    
      29
      0
      0
      0.614796
      16.0
      17.3
      2
      5
      1
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      470
      0
      2
      0.762119
      15.0
      17.0
      5
      11
      1
    
    
      471
      0
      2
      1.307048
      15.0
      17.0
      5
      11
      1
    
    
      472
      1
      2
      2.370201
      14.3
      17.0
      1
      2
      1
    
    
      473
      0
      1
      3.077544
      16.3
      18.0
      3
      8
      0
    
    
      474
      0
      2
      4.662037
      15.0
      17.0
      5
      1
      0
    
    
      475
      0
      0
      3.815969
      16.0
      17.3
      2
      6
      0
    
    
      476
      0
      1
      2.109537
      16.3
      18.0
      3
      8
      0
    
    
      477
      0
      1
      1.801343
      16.3
      18.0
      3
      7
      1
    
    
      478
      0
      1
      2.429583
      16.3
      18.0
      3
      8
      0
    
    
      479
      1
      1
      3.827953
      17.0
      19.0
      4
      9
      1
    
    
      480
      0
      2
      0.553112
      15.0
      17.0
      5
      11
      1
    
    
      481
      0
      1
      3.173730
      16.3
      18.0
      3
      8
      0
    
    
      482
      0
      0
      0.712780
      16.0
      17.3
      2
      5
      1
    
    
      483
      1
      1
      4.354164
      17.0
      19.0
      4
      9
      1
    
    
      484
      0
      0
      3.119144
      16.0
      17.3
      2
      6
      0
    
    
      485
      0
      2
      2.934013
      15.0
      17.0
      5
      1
      0
    
    
      486
      1
      1
      2.008569
      17.0
      19.0
      4
      10
      0
    
    
      487
      0
      1
      3.102782
      16.3
      18.0
      3
      8
      0
    
    
      488
      0
      2
      4.865761
      15.0
      17.0
      5
      1
      0
    
    
      489
      1
      0
      2.015491
      17.3
      20.0
      0
      4
      0
    
    
      490
      0
      1
      4.008516
      16.3
      18.0
      3
      8
      0
    
    
      491
      1
      2
      0.008365
      14.3
      17.0
      1
      2
      1
    
    
      492
      0
      2
      1.742337
      15.0
      17.0
      5
      11
      1
    
    
      493
      0
      1
      1.454144
      16.3
      18.0
      3
      7
      1
    
    
      494
      1
      1
      2.889622
      17.0
      19.0
      4
      9
      1
    
    
      495
      1
      1
      3.688792
      17.0
      19.0
      4
      9
      1
    
    
      496
      0
      1
      3.429030
      16.3
      18.0
      3
      8
      0
    
    
      497
      0
      1
      3.839694
      16.3
      18.0
      3
      8
      0
    
    
      498
      0
      2
      3.425603
      15.0
      17.0
      5
      1
      0
    
    
      499
      1
      2
      0.824645
      14.3
      17.0
      1
      2
      1
    
  

500 rows × 8 columns

X contains all the values besides whether the map was used or not and y contains the answer



In [12]:

    
X = processed_df.drop(['mapUsed'], axis=1).values
y = processed_df['mapUsed'].values



In [13]:

    
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y,test_size=0.2)

decision tree



In [14]:

    
clf_dt = tree.DecisionTreeClassifier(max_depth=10)



In [15]:

    
clf_dt.fit (X_train, y_train)
clf_dt.score (X_test, y_test)









    Out[15]:





0.98999999999999999



In [16]:

    
shuffle_validator = cross_validation.ShuffleSplit(len(X), n_iter=20, test_size=0.2, random_state=0)
def test_classifier(clf):
    scores = cross_validation.cross_val_score(clf, X, y, cv=shuffle_validator)
    print("Accuracy: %0.4f (+/- %0.2f)" % (scores.mean(), scores.std()))



In [17]:

    
test_classifier(clf_dt)









    



Accuracy: 0.9820 (+/- 0.01)



In [18]:

    
clf_rf = ske.RandomForestClassifier(n_estimators=50)
test_classifier(clf_rf)









    



Accuracy: 0.9935 (+/- 0.01)



In [19]:

    
clf_gb = ske.GradientBoostingClassifier(n_estimators=50)
test_classifier(clf_gb)









    



Accuracy: 0.9825 (+/- 0.01)



In [20]:

    
eclf = ske.VotingClassifier([('dt', clf_dt), ('rf', clf_rf), ('gb', clf_gb)])
test_classifier(eclf)









    



Accuracy: 0.9895 (+/- 0.01)

neural Network



In [21]:

    
#tf_clf_dnn = skflow.TensorFlowDNNClassifier(hidden_units=[20, 40, 20], n_classes=2, batch_size=256, steps=1000, learning_rate=0.05)
feature_columns = [tf.contrib.layers.real_valued_column("")]
tf_clf_dnn = skflow.DNNClassifier(feature_columns=feature_columns, hidden_units=[20, 40, 20], n_classes=2, model_dir="/tmp")
#tf_clf_dnn.evaluate(batch_size=256, steps=1000)
tf_clf_dnn.fit(X_train, y_train, steps=1000)
accuracy_score = tf_clf_dnn.evaluate(X_test, y_test,steps=1000)["accuracy"]
print("\nTest Accuracy: {0:f}\n".format(accuracy_score))









    



INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_tf_random_seed': None, '_task_id': 0, '_save_summary_steps': 100, '_is_chief': True, '_master': '', '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1
}
, '_num_worker_replicas': 0, '_save_checkpoints_secs': 600, '_environment': 'local', '_keep_checkpoint_every_n_hours': 10000, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x000002023EAE70B8>, '_save_checkpoints_steps': None, '_num_ps_replicas': 0, '_task_type': None, '_model_dir': None, '_keep_checkpoint_max': 5, '_evaluation_master': ''}
WARNING:tensorflow:From <ipython-input-21-897a40b9cd7b>:5: calling BaseEstimator.fit (from tensorflow.contrib.learn.python.learn.estimators.estimator) with y is deprecated and will be removed after 2016-12-01.
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
WARNING:tensorflow:From <ipython-input-21-897a40b9cd7b>:5: calling BaseEstimator.fit (from tensorflow.contrib.learn.python.learn.estimators.estimator) with x is deprecated and will be removed after 2016-12-01.
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
WARNING:tensorflow:float64 is not supported by many models, consider casting to float32.






    



C:\Users\Saumya Suvarna\Anaconda3\lib\site-packages\tensorflow\python\util\deprecation.py:248: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
  equality = a == b






    



WARNING:tensorflow:From C:\Users\Saumya Suvarna\Anaconda3\lib\site-packages\tensorflow\contrib\learn\python\learn\estimators\head.py:615: scalar_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /tmp\model.ckpt.
INFO:tensorflow:loss = 0.912484, step = 1
INFO:tensorflow:global_step/sec: 82.549
INFO:tensorflow:loss = 0.505164, step = 101 (1.219 sec)
INFO:tensorflow:global_step/sec: 130.634
INFO:tensorflow:loss = 0.281694, step = 201 (0.764 sec)
INFO:tensorflow:global_step/sec: 119.969
INFO:tensorflow:loss = 0.121146, step = 301 (0.835 sec)
INFO:tensorflow:global_step/sec: 137.596
INFO:tensorflow:loss = 0.0692203, step = 401 (0.725 sec)
INFO:tensorflow:global_step/sec: 137.418
INFO:tensorflow:loss = 0.0492737, step = 501 (0.722 sec)
INFO:tensorflow:global_step/sec: 153.874
INFO:tensorflow:loss = 0.0366978, step = 601 (0.645 sec)
INFO:tensorflow:global_step/sec: 196.662
INFO:tensorflow:loss = 0.0277828, step = 701 (0.514 sec)
INFO:tensorflow:global_step/sec: 213.493
INFO:tensorflow:loss = 0.0221953, step = 801 (0.470 sec)
INFO:tensorflow:global_step/sec: 128.621
INFO:tensorflow:loss = 0.0178459, step = 901 (0.775 sec)
INFO:tensorflow:Saving checkpoints for 1000 into /tmp\model.ckpt.
INFO:tensorflow:Loss for final step: 0.0141316.
WARNING:tensorflow:From <ipython-input-21-897a40b9cd7b>:6: calling BaseEstimator.evaluate (from tensorflow.contrib.learn.python.learn.estimators.estimator) with y is deprecated and will be removed after 2016-12-01.
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
WARNING:tensorflow:From <ipython-input-21-897a40b9cd7b>:6: calling BaseEstimator.evaluate (from tensorflow.contrib.learn.python.learn.estimators.estimator) with x is deprecated and will be removed after 2016-12-01.
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
WARNING:tensorflow:float64 is not supported by many models, consider casting to float32.
WARNING:tensorflow:From C:\Users\Saumya Suvarna\Anaconda3\lib\site-packages\tensorflow\contrib\learn\python\learn\estimators\head.py:615: scalar_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Starting evaluation at 2017-07-23-18:24:01
INFO:tensorflow:Restoring parameters from /tmp\model.ckpt-1000
INFO:tensorflow:Evaluation [1/1000]
INFO:tensorflow:Finished evaluation at 2017-07-23-18:24:03
INFO:tensorflow:Saving dict for global step 1000: accuracy = 0.99, accuracy/baseline_label_mean = 0.48, accuracy/threshold_0.500000_mean = 0.99, auc = 0.9998, global_step = 1000, labels/actual_label_mean = 0.48, labels/prediction_mean = 0.481941, loss = 0.0285349, precision/positive_threshold_0.500000_mean = 1.0, recall/positive_threshold_0.500000_mean = 0.979167
WARNING:tensorflow:Skipping summary for global_step, must be a float or np.float32.

Test Accuracy: 0.990000

	uid	metro	rating	startTime	endTime	mapUsed
country
England	348.426036	0.502959	2.403062	16.653846	18.500592	0.455621
Germany	354.237179	0.506410	2.568901	16.654487	18.506410	0.480769
Russia	349.171429	0.514286	2.487564	14.640000	17.000000	0.468571

		uid	rating	startTime	endTime	mapUsed
country	metro
England	0	346.690476	2.526212	16.0	17.300000	0.464286
England	1	350.141176	2.281361	17.3	19.687059	0.447059
Germany	0	373.961039	2.515311	16.3	18.000000	0.441558
Germany	1	335.012658	2.621135	17.0	19.000000	0.518987
Russia	0	353.611765	2.548772	15.0	17.000000	0.423529
Russia	1	344.977778	2.429756	14.3	17.000000	0.511111

	uid	name	metro	country	rating	startTime	endTime	oldRoute	newRoute	mapUsed
0	101	Elisabeth Walton	1	England	2.541688	17.3	19.3	R1	R1	1
1	102	Hudson Trevor	0	Russia	1.858011	15.0	17.0	R9	R9	1
2	103	Helen Loraine	1	Germany	0.006557	17.0	19.0	R7	R8	0
3	104	Joshua Creighton	1	England	3.832214	17.3	19.3	R1	R1	1
4	105	Bessie Daniels	0	England	3.400957	16.0	17.3	R3	R4	0